Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make cudf.pandas proxy array picklable #17929

Merged

Conversation

Matt711
Copy link
Contributor

@Matt711 Matt711 commented Feb 6, 2025

Description

Apart of #17490.

We employ custom pickling logic for our cudf.pandas wrapped types. The logic lets us serialize and de-serialize wrapped types by serializing and de-serializing the underlying wrapped types (ie. the type of _fsproxy_wrapped). This pickling logic is defined in _FinalProxy, which is the base class of all of our "final" proxy types.

The failures in the integration tests occurred because this pickling logic wasn't used for the proxy numpy array type. This is because the "final" proxy array type inherits from an additional base class: ProxyNDarrayBase (which contains logic to inherit from np.ndarray ). And it comes before _FinalProxy in the classes MRO, so the custom pickling is not used.

Additionally, the custom pickling logic used for other proxy types is incompatible with our proxy array. So this PR defines a custom function for handling proxy array serialization.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@Matt711 Matt711 requested a review from a team as a code owner February 6, 2025 15:51
@Matt711 Matt711 requested review from vyasr and mroeschke February 6, 2025 15:51
@Matt711 Matt711 added the bug Something isn't working label Feb 6, 2025
@github-actions github-actions bot added Python Affects Python cuDF API. cudf.pandas Issues specific to cudf.pandas labels Feb 6, 2025
@Matt711 Matt711 added the non-breaking Non-breaking change label Feb 6, 2025
@Matt711
Copy link
Contributor Author

Matt711 commented Feb 6, 2025

I'll continue once #17936 is merged. That will help ensure that the tests are passing with the changes in this PR.

@Matt711 Matt711 changed the title Fix numpy-proxying failures in the third-party integrations tests Make cudf.pandas proxy types explictly call our custom pickling logic Feb 9, 2025
@Matt711
Copy link
Contributor Author

Matt711 commented Feb 12, 2025

/ok to test

@Matt711 Matt711 requested a review from mroeschke February 13, 2025 01:33
Copy link
Contributor

@mroeschke mroeschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This problem also makes me wonder what other methods our proxy numpy array is getting from np.ndarray instead of our _FinalProxy because the np.ndarray is first in the MRO, but that's another problem for another day

Copy link
Contributor Author

@Matt711 Matt711 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry this isn't quite ready to merge yet. Looking into the failures now.

Edit: I'll request another review from you @mroeschke when its ready

@Matt711 Matt711 added the DO NOT MERGE Hold off on merging; see PR for details label Feb 13, 2025
@Matt711 Matt711 changed the title Make cudf.pandas proxy types explictly call our custom pickling logic Make cudf.pandas proxy types picklable Feb 13, 2025
@Matt711 Matt711 changed the title Make cudf.pandas proxy types picklable Make cudf.pandas proxy array picklable Feb 13, 2025
@Matt711 Matt711 removed the DO NOT MERGE Hold off on merging; see PR for details label Feb 13, 2025
@Matt711 Matt711 requested a review from mroeschke February 13, 2025 15:31
@Matt711
Copy link
Contributor Author

Matt711 commented Feb 13, 2025

This problem also makes me wonder what other methods our proxy numpy array is getting from np.ndarray instead of our _FinalProxy because the np.ndarray is first in the MRO, but that's another problem for another day

Thanks, we probably okay now. The methods that ndarray inherits are

  1. _fsproxy_fast_to_slow
  2. _fsproxy_slow_to_fast
  3. _fsproxy_fast
  4. _fsproxy_slow
  5. __dir__
  6. __setattr__
  7. _fsproxy_wrap
  8. __reduce__
  9. __setstate__

They we're all accounted for except the two used for pickling.

@Matt711
Copy link
Contributor Author

Matt711 commented Feb 13, 2025

/merge

@rapids-bot rapids-bot bot merged commit 9ead47b into rapidsai:branch-25.04 Feb 13, 2025
109 checks passed
Comment on lines +1984 to +1996
def test_pickle_round_trip_proxy_numpy_array(array):
arr, proxy_arr = array
pickled_arr = BytesIO()
pickled_proxy_arr = BytesIO()
pickle.dump(arr, pickled_arr)
pickle.dump(proxy_arr, pickled_proxy_arr)

pickled_arr.seek(0)
pickled_proxy_arr.seek(0)

np.testing.assert_equal(
pickle.load(pickled_proxy_arr), pickle.load(pickled_arr)
)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: This can be simplified a bit, I think. I'll tack it on to another PR.

def test_pickle_round_trip_proxy_numpy_array(array):
    arr, proxy_arr = array
    np.testing.assert_equal(
        pickle.loads(pickle.dumps(proxy_arr)), 
        pickle.loads(pickle.dumps(arr))
    )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cudf.pandas Issues specific to cudf.pandas non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

3 participants